Analysis of the Chemical Properties of Red Wine by Andrew Bryant

The dataset I use has data on the quality and chemical properties of red wine. The dataset was found tidy, so no prior cleaning was necessary before beginning analysis. I chose this dataset because…well…I like red wine!

Structure of the dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

The dataset has 1,599 observations and 13 variables. The variables measure some chemical properties of the wines, as well as the wine’s quality as rated by a panel of wine experts. All data are either numbers or integers.

Univariate Plots Section

Histograms of every variable

Summary Statistics

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The most important variable to me is the quality of the wine. People won’t buy or not buy a wine based on its citric acid content, but they will based on its quality. Examining this variable is a good place to start.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Most wines have a quality rating of either 5 or 6.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

From the historgram we can see that the distribution of the alcohol content of the wines is skewed to the right. There is a long tail of wines with alcohol contents well above the median.

The boxplot reveals that there are some outliers above an alcohol content of 13 percent.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH is a variable that is normally distributed. From the summary you can see that the mean and median of this variable are almost the same. Wine is acidic, meaning that it has a pH of less than 7. ### Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

There are many high outliers in sulphates. The data are skewed to the right. Faceted by quality, the shape of the boxplots is different. Also, there are differing levels of outliers. It appears that qualities 5 and 6 have more outliers with higher and lower qualities having fewer.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

When you look at a boxplot of citric acid, the spread of the data looks relatively inocuous, but when you see the historgram you see that the data are all over the place. There are two spikes around 0.0 and 0.5, with the data rising and falling in between. When you look at the box plot, you can see that there are not outliers.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

I am interested in quality; what goes into making a good bottle of wine? How does the chemical composition of wine differ between the bad, average, and great wines?

Something that was cool was faceting variables by quality while making charts and seeing if any differences appeared.

What other features in the dataset do you think will help support
### your investigation into your feature(s) of interest?

Don’t know. However, I may have to learn more than I ever thought I’d learn about the chemical composition of wine to see which variables most influence the quality of wine.

Did you create any new variables from existing variables in the dataset?

I haven’t created any new variables yet, I just explored existing ones.

Of the features you investigated, were there any unusual distributions?

The distribution of the citric acid variable was unusual. All others were either normally distributed or skewed to the right.

Bivariate Plots Section

Alcohol and Quality

The million dollar question: Does better wine get you more drunk?

There looks to be a positive correation between the two variables. That means that higher quality wines appear to have higher alcohol contents.

Alcohol and pH content

## [1] 0.2056325

There is a positive correlation between alcohol and pH.

Density and Alcohol

## [1] -0.4961798

There is a negative correlation between the two variables. The correlation value is about -.5. There are cases where an wine with a higher density has more alcohol than one with a lower density but the general trend is that wines with more alcohol are less dense. The reason is because alcohol is less dense than water.

pH and Density

## [1] -0.3416993

There looks is a negative correlation between these variables.

pH and Quality

## [1] -0.05773139

Wines with a lower pH are more acidic. The presence of acids in the wine influence taste, so more or less acids in the wine may influence its perceived quality. However, upon looking at the data, there isn’t any correlation, between the two variables. The level of acid in the wine seems to not affect its quality.

Faceted by quality, the normal distribution remains seen in the univariate histogram. Wines of each quality level have pH levels that are normally distributed, with pHs that are all in similar ranges.

Citric Acid and Quality

## [1] 0.2263725

There are ‘bars’ of quality in this graph. They are cool looking, but it is just a side effect of the ‘jitter’ parameter because all the wines have quality ratings of integers.

Zooming into the graph by subsetting on quality, the citric acid content is lower for the lower quality wines. This could be due to the raters’ individual preferences, or due to something more structural about the chemical composition of the wine.

Sulphates and Quality

## [1] 0.2513971

There is some positive correlation between the level of sulphates in a wine and its quality.

Fixed & Volatile Acidity and Quality

## [1] 0.1240516

## [1] -0.3905578

There is a weak between fixed acidity and quality. When you remove the highest and lowest rated wines that relationship becomes clearer.

Volatile acidity refers to the presence of steam-distillable acids in the wine. Wine spoilage is measured by volatile acidity. Higher quantities of volatile acidity may indicate spoilage, and thus reduce the quality of the wine. That could explain the negative correlation seen between volatile acidity and quality.

The legal limits for volatile acidity for red wine in the United States is 1.2 grams per liter. Sure enough, almost all of the wines have volatile acidities less than this amount.

Sulfur Dioxide and Alcohol

## [1] -0.06940835

## [1] -0.2056539

There is a weak negative correlation between sulfur dioxide and alcohol. Upon further research, however, it doesn’t appear that the factors are linked. Some sulfur dioxide is produced during the fermentation process, but most of it in wine is added by winemakers as a preservative. It doesn’t play a role in the creation of alcohol.

Bivariate Analysis

Talk about some of the relationships you observed

I saw a relationship betwee volatile acidity and quality, as well as between factors like pH and alcohol and density. I also saw a relationship between alcohol content and quality.

Did you observe any interesting relationships between the other features
### (not the main feature(s) of interest)?

The relationship between density and alcohol content, as well as between pH and density.

What was the strongest relationship you found?

Between density and alcohol content. The more alcohol a wine has, the less dense it is. This is because alcohol is less dense than water.

Multivariate Plots Section

Alcohol, Quality, Citric Acid

At a given level of alcohol content, there isn’t any clear relationship betwen citric acid and quality. The correlation is negative for the highest and lowest qualities, almost zero for quality 5, and slightly positive for quality 6.

Alcohol, Fixed Acidity, Quality

There is a bit of a relationship between alcohol and fixed acidity. For most qualities of wine, the correlation between fixed acidity and alcohol is negative.

Multivariate Analysis

Talk about some of the relationships you observed

I this section I tried to see the relationship between factors that contributed to higher alcohol contents in wine, and quality. When there is less fixed acidity in the wine, there is more alcohol in the wine.

Were there any interesting or surprising interactions between features?

In this section there weren’t any findings that really jumped out at me.

OPTIONAL: Did you create any models with your dataset?

I didn’t create any models. Many factors go influence the winemaking process and I would be hesitant to create a model that says that just a handful of them can predict the quality of a wine.


Final Plots and Summary

Plot One

Description One

This is a histogram of the ratings the wine experts gave the wines in the dataset. I chose this plot because it is the variable that I wanted to find out more about. What makes wine good?

Plot Two

Description Two

I decided to show this plot as a final plot because alcohol is a variable that is correlated with quality in a dataset where there weren’t many variables correlated with quality. It could be used to further investigate factors that make wine good.

Plot Three

Description Three

This is a scatterplot of alcohol content and fixed acidity. There are linear regression lines for each level of wine quality. This chart shows a variable that may influence the level of alcohol, which we have shown is positively correlated with quality. This chart shows a starting point for those interested in further exploring what factors influence the quality of wine.


Reflection

After analyzing this dataset, I’ve come to realize that winemaking is more art than science. There are thousands of factors that go into the flavor of a bottle of wine, most of which are not captured in the dataset. Also, people perceive taste differently and prefer some tastes over others. Had another group of experts rated the wines, we might have different results.

This is shown in the dataset by the lack of correlation between quality and other factors, such as acidity. Acidity is one of the factors that influences a wine’s taste, but for each level of pH or fixed acidity, there isn’t a clear relationship between those variabes and quality. There was a correlation between quality and alcohol content, and between fixed acidity and alcohol but I think that there are other factors that influence this relationship. You can’t just dump a bunch of alcohol in a batch of wine or add acid to the wine to make it taste better! So, I wouldn’t go as far to say that more alcohol or acid makes wine taste better.

I’d also be hesitant to rely much on any mathematical model to judge whether wine is good or not, because of the number of factors and subjective nature of quality.

A way that this dataset could be improved is by including the region in which the wine grapes were grown, as well as climate data for the growing season. Climate and region are very important factors; they acidity and alcohol content, which modify a wine’s taste. Adding these variables to the dataset may uncover more patterns into the interaction of the components that make up wine. The reserachers who collected the data and constructed the dataset declined to include this sort of data for privacy reasons, but having access to it may yield intereting insights.

Successes during the analysis were finding some correlations. It was also fun learning a bit more about wine. However, we have to keep in mind that correlation doesn’t imply causation and that there are many factors that could cause the correlations that we see, or they could even just be random. A struggle for me was constantly reminding myself that there may be more to the correlations than the chart shows, or that they might not even mean anything.